Red Wine Investigation

By Ravi Dayabhai

Introduction

This exploratory data analysis dives into a data set of 1,599 red wines with 11 variables describing the chemical properties of the wine. At least 3 wine experts rated the quality of each wine, providing a rating between 0 (very bad) and 10 (very excellent). The data came in “wide” format, i.e., one wine per row, each wine described by variables (one variable per column).

##   X fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 1           7.4             0.70        0.00            1.9     0.076
## 2 2           7.8             0.88        0.00            2.6     0.098
## 3 3           7.8             0.76        0.04            2.3     0.092
## 4 4          11.2             0.28        0.56            1.9     0.075
## 5 5           7.4             0.70        0.00            1.9     0.076
## 6 6           7.4             0.66        0.00            1.8     0.075
##   free.sulfur.dioxide total.sulfur.dioxide density   pH sulphates alcohol
## 1                  11                   34  0.9978 3.51      0.56     9.4
## 2                  25                   67  0.9968 3.20      0.68     9.8
## 3                  15                   54  0.9970 3.26      0.65     9.8
## 4                  17                   60  0.9980 3.16      0.58     9.8
## 5                  11                   34  0.9978 3.51      0.56     9.4
## 6                  13                   40  0.9978 3.51      0.56     9.4
##   quality
## 1       5
## 2       5
## 3       5
## 4       6
## 5       5
## 6       5

It’s important to note the population from which this sample was drawn: exclusively Portuguese wine from the Vinho Verde region. Hence, conclusions inferred from this data are not sufficiently generalizable for all variants of red wine. We also lack information on grape types, wine brand, wine selling price, etc. These variables also impede our ability to make more sweeping generalizations, but exploring the

Descriptor Variables
Variable Description
fixed acidity the amount of tartaric acid in the wine; most acids involved with wine are fixed or nonvolatile (do not evaporate readily)
volatile acidity the amount of acetic acid in the wine, which at too high of levels can lead to an unpleasant, vinegar taste
citric acid found in small quantities, citric acid can add ‘freshness’ and flavor to wines
residual sugar the amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/liter and wines with more than 45 grams/liter are considered sweet
chlorides the amount of salt in the wine
free sulfur dioxide the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine
total sulfur dioxide the amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine
density the density depends on the percent alcohol and sugar content, but approximates that of water
pH describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale
sulphates a wine additive which can contribute to sulfur dioxide gas (S02) levels, which acts as an antimicrobial and antioxidant
alcohol the percent alcohol content of the wine

I’ll begin my investigation by first summarizing our data (below) and taking a look at unvariate distribrutions of each variable.

Motivating Question

The primary feature of interest is the quality of the wine (captured by the variable labeled quality). The other variables, various chemical properties, are supporting features. The primary question to answer is, “Which chemical properties influence the quality of red wines?”

Univariate Plots & Analysis Section

The following plots all of univariate distributions of each variable just to get a rough sense of how each variable in this sample behaves.

##        X        fixed.acidity   volatile.acidity  citric.acid   
##  1      :   1   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  2      :   1   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  3      :   1   Median : 7.90   Median :0.5200   Median :0.260  
##  4      :   1   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  5      :   1   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  6      :   1   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  (Other):1593                                                   
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##                                                        
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##                                                                        
##     alcohol         quality     
##  Min.   : 8.40   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.20   Median :6.000  
##  Mean   :10.42   Mean   :5.636  
##  3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :14.90   Max.   :8.000  
## 

From these histograms (50 bins per plot), I can glean that:

For the skewed distributions, I looked at transformations to see if we could get a better sense of what’s at play. Plots that didn’t seem to need a log transformation were omitted.)

Immediately, we see certain variables approach a more symmetric distribution: fixed.acidity, volatile.acidity, and sulphates. The other skewed variables also approached symmetry, but were not as “nicely” shaped as the aforementioned ones.

Below, I take a look at some of the more skewed distributions of a few explanatory variables to get a better sense of outliers. I don’t treat outliers in this section since outliers might actually define wines that are either delicious or disgusting, so I consider conditional outlier adjustment in the bivariate analysis below.

The variables I’ll examine further include the following:

I transformed the data to exclude outliers that fall outside \(\pm 2\) standard deviations of the mean of a given variable’s distribution, and then compared plots of the transformed data with the original data to see how the adjusted distributions looked.

The plots on the left show the untransformed univariate distribution for the aforementioned variables; the plots on the right show the distributions for the same variables, but log-transformed with outliers (as defined above) removed. It’s interesting to see that virtually all ranges of values collapse (as we would expect), but the resulting shapes of the distributions to change significantly (e.g., alcohol approaches lognormal distribution once outliers are accounted for).

Again, I do not plan on removing these outliers until doing conditional quality breakouts since univariate analysis doesn’t allow me to see the proportion of qualities represented among outliers identified for a particular variable.

I created “relative quality”" variables (rquality.3, rquality.2) that categorizes the numeric quality scores assigned to each wine. The labeling convention follow the number of categories the quality scores are mapped to (i.e., rquality.3 breaks quality out into 3 categories, rquality.2 does the same but for two categories).

One thing that is obscuring Good from Bad wines is the overwhelming number of middling wines (i.e., those rated 5 or 6).

## # A tibble: 3 x 2
## # Groups:   rquality.3 [3]
##   rquality.3     n
##        <ord> <int>
## 1        Bad    63
## 2     Medium  1319
## 3       Good   217

Even collapsing quality to a relative quality variable, rquality.3, does a somewhat better job of binning this concentrated, categorical response variable.

Bivariate Plots & Analysis Section

Below, I change bins = 15(note the bias-variance tradeoff evidenced in these polygons versus the histograms above), simplify the quality ratings to categorical variables, and reproduce the above histograms to see how individual variables relate to quality (according to category).

Once the data was divided into relative quality categories, this allowed me to do analysis conditioning on these categories.

I adjusted the vertical axis to show proportions rather than gross counts and changed the shape to be a frequency polygon for easier viewing. This gives, at least, an initial sense of where deviations in representation might suggest which variables are important determinants of wine quality.

Below, I explore certain variables in more detail based on the EDA done using the two initial multiplots. A cursory first glance suggests that sulphates, pH, volatile.acidity, fixed.acidity, alcohol, and citric.acid might be of interest. To double-check this, I generated a series of box-and-whisker plots overlaying a jittered scatter plot.

## # A tibble: 33 x 6
## # Groups:   rquality.3 [?]
##    rquality.3                  key        mean  median    range
##         <ord>                <chr>       <dbl>   <dbl>    <dbl>
##  1        Bad              alcohol 10.21587302 10.0000   4.7000
##  2        Bad            chlorides  0.09573016  0.0800   0.5650
##  3        Bad          citric.acid  0.17365079  0.0800   1.0000
##  4        Bad              density  0.99668873  0.9966   0.0076
##  5        Bad        fixed.acidity  7.87142857  7.5000   7.9000
##  6        Bad  free.sulfur.dioxide 12.06349206  9.0000  38.0000
##  7        Bad                   pH  3.38412698  3.3800   1.1600
##  8        Bad       residual.sugar  2.68492063  2.1000  11.7000
##  9        Bad            sulphates  0.59222222  0.5600   1.6700
## 10        Bad total.sulfur.dioxide 34.44444444 26.0000 112.0000
## # ... with 23 more rows, and 1 more variables: std.dev <dbl>

The variables identified earlier by eye-balling the frequency plots were corroborated using the box-and-whiskers plots abd the summary table above as having noticeably different distributions per relative quality category. These are the variables I’ll keep in mind for regressions later in this investigation.

Based on the analysis thus far, I presume that Good red wines (from the Vinho Verde region in Portugal) exhibit relatively lower volatile acidity, higher sulphate and citric acid levels, and higher alcohol content. It also seems that lower pH is favored despite higher fixed acid levels. The bivariate analysis below should give me a sense of whether all of these are good explanatory variables or if some combinations are redundant (in their ability to explain variance of wine quality).

Residual sugar, total sulfur dioxide, and chlorides aren’t easy to see in the grid, so I’ve broken each on out below.

Even when controlling for outliers (as the code above does), it doesn’t seem that these variables (on visual inspection) are compelling determinants of wine quality.

I want to get a sense of how “clustered” wines might be (for each variable) by a wine’s relative quality. I hypothesize the massive “medium” wine data will exhibit more variation. I test this hypothesis below by running conditional standard deviation and range calculations on relative wine quality.

Interestingly, my initial assertion that dispersion is higher for “medium” wines is somewhat corroborated when looking at the condition ranges (i.e., 6 of the 11 variables show higher ranges when conditioning on “medium” quality wine.) However, when looking at standard deviation, the story changes. In only two variables (free.sulfur.dioxide and total.sulfur.dioxide) show “medium” wines having higher standard deviations than wines of either “good” or “bad” quality!

This leads me to believe that outliers are, perhaps, having an outsized impact on my visual interpretation.

Just to be sure, I perform the same exercise, but I reduce the relative quality category count from 3 (“Good”, “Medium”, “Bad”) to 2 (“Good”, “Bad”). I don’t think this will help, but it is included below for the sake of exhaustiveness and to see if reduced variance helps distinguish a signal.

## # A tibble: 2 x 2
## # Groups:   rquality.2 [2]
##   rquality.2     n
##        <ord> <int>
## 1        Bad   744
## 2       Good   855

As expected, two relative quality categories isn’t enough, three seems to be the best. This is due to the massive concentration of 5 and 6 quality wines, which tend to be very similar – when these are split between two categories they overwhelm the small number of differentiated scores (i.e., 3, 4, 7, and 8). Generally, the same variables stand out (as establishing a relationship with relative quality).

Based on the EDA of each variable above, the following chemical properties seem promising:

Interestingly, these variables also tended to show the most positive skew (see first set of histograms) versus other variables which looked to be more normally distributed.

I want to make pairwise comparisons of each of my variables, with an eye especially toward the plots of wine quality vs. [chemical property variables]. This will allow me to quickly spot-check which bivariate comparisons seem to exhibit meaningful relationships. To do this, I created a scatterplot matrix below. For clarity of relationships, I’ve also generated a correlation matrix.

We see from the matrix and the correlation “heat map” that there are a few strong relationships (proxied by correlation coeffecient of magnitude 0.5 and above) among the explanatory variables (e.g., citric acid and volatile acidity, citric acid and fixed acidity, density and fixed acidity, etc.). However, only alcohol seems to meet this threshold in relation to the response variable of interest, quality.

To get a better sense of the pairwise comparisons, I plot the pairwise comparisons of the each of the interesting variables identified in the univariate analysis above against the dependent variable, quality.

qbivs <- ggplot(df.long2.c3, aes(x = value, y = quality)) + 
    geom_point(alpha = 0.2, position = position_jitter()) +
    geom_smooth(aes(group = 1), se = FALSE, method = "lm") +
    facet_wrap( ~ key, scales = "free")

qbivs

When we look at a facetted view of all chemical properties vs. quality, we can confirm that the variance is generally lower among the highest and lowest quality wines for most variables, including those of indentified interest from above. The relationships between quality and chemical properties present in these data isn’t super clear, so below, I play with a few of the quality vs. chemical property relationships a bit more.

The strength of the relationship between alcohol, at least visually, is more substantitive when controlling for conditional outliers (summarized below). This is in-line with the adjustments made to outliers (unconditional) in the Univariate Analysis section.

## 
##  Pearson's product-moment correlation
## 
## data:  df.alcohol.no_ols$quality and df.alcohol.no_ols$alcohol
## t = 25.877, df = 1513, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.5179716 0.5878493
## sample estimates:
##      cor 
## 0.553885
## 
##  Pearson's product-moment correlation
## 
## data:  df.alcohol$quality and df.alcohol$alcohol
## t = 21.639, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4373540 0.5132081
## sample estimates:
##       cor 
## 0.4761663

For the following plots of chloride and sulphates, I transformed the x-axis in accordance to the finding that the distribution of these variables are lognormal.

## 
##  Pearson's product-moment correlation
## 
## data:  df.sulphates.no_ols$quality and df.sulphates.no_ols$sulphates
## t = 18.839, df = 1517, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.3937575 0.4753162
## sample estimates:
##       cor 
## 0.4354299
## 
##  Pearson's product-moment correlation
## 
## data:  df.sulphates$quality and df.sulphates$sulphates
## t = 10.38, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2049011 0.2967610
## sample estimates:
##       cor 
## 0.2513971

The relationship between sulphate content and quality becomes much clearer (read: the improvement from 0.25 to nearly 0.44 correlation coefficient) once conditional outliers were removed. The latter plot shows a generally positive relationship.

No new insights when cutting residual sugar conditional outliers.

There is some additional clarity for total sulfur dioxide, but it’s tough to say whether a relationship exists. The initial boxplots suggest that low total sulfure dioxide levels are only seen in markedly “good” or “bad” wines. Given our sample, this holds true for measures of central tendency at the extremes as well (“8” wines and “3” wines show lowest levels of total sulfure dioxide.), as illustrated in the next plot.

This last relationship is also substantiated a bit by controlling for conditional outliers.

There are a specific relationships I want to delve into more, given the scatterplot matrix:

First, a look at the so-called “sulfur”" measures.

This is an interesting plot – it exhibits obvious heteroskedacity, but each of the slopes of the regression line differ by relative quality of wine.

Here’s a residual plot showing the heteroskedacity in a clearer way; notice the “fanning” pattern in the residual plot and deviations from normality in the Q-Q plot.

## 
##  Pearson's product-moment correlation
## 
## data:  df$total.sulfur.dioxide and df$free.sulfur.dioxide
## t = 35.84, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.6395786 0.6939740
## sample estimates:
##       cor 
## 0.6676665

Doing the same analysis for other explanatory variable relationships yields similar results. Below, see the comparison of a few acidity measures.

First, volatile.acidity vs. citric.acid.

# Plot: Volatile Acidity vs. Citric Acidity
volatile.base <- ggplot(df, aes(x = citric.acid, y = volatile.acidity, 
               color = rquality.3)) + 
           geom_point(position = position_jitter(), alpha = 0.3) +
           facet_wrap(~ rquality.3 , scale = "free") +
    geom_smooth(method = "lm", se = TRUE)

volatile.base

# Correlation
cor.test(method = "pearson",  df$volatile.acidity, df$citric.acid)
## 
##  Pearson's product-moment correlation
## 
## data:  df$volatile.acidity and df$citric.acid
## t = -26.489, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.5856550 -0.5174902
## sample estimates:
##        cor 
## -0.5524957

Second, volatile.acidity vs. fixed.acidity.

# Plot: Volatile Acidity vs. Fixed Acidity
volatile.fixed <- ggplot(df, aes(x = fixed.acidity, y = volatile.acidity, 
               color = rquality.3)) +
           geom_point(position = position_jitter(), alpha = 0.3) +
           facet_wrap(~ rquality.3 , scale = "free") +
    geom_smooth(method = "lm", se = TRUE)

volatile.fixed

# Correlation
cor.test(method = "pearson",  df$volatile.acidity, df$citric.acid)
## 
##  Pearson's product-moment correlation
## 
## data:  df$volatile.acidity and df$citric.acid
## t = -26.489, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.5856550 -0.5174902
## sample estimates:
##        cor 
## -0.5524957

First, fixed.acidity vs. pH.

# Plot: Volatile Acidity vs. pH
fixed.pH <- ggplot(df, aes(x = pH, y = volatile.acidity, 
               color = rquality.3)) +
           geom_point(position = position_jitter(), alpha = 0.3) +
           facet_wrap(~ rquality.3 , scale = "free") +
    geom_smooth(method = "lm", se = TRUE)

fixed.pH

# Correlation
cor.test(method = "pearson",  df$fixed.acidity, df$pH)
## 
##  Pearson's product-moment correlation
## 
## data:  df$fixed.acidity and df$pH
## t = -37.366, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.7082857 -0.6559174
## sample estimates:
##        cor 
## -0.6829782

Because there is clear interation between these variables, generally, I’ll start by only including one of them at a time in the model building process since combinations won’t be as significant predictors.

To recap, my initial list of variables of interest still seem compelling, but after investigating their inter-relationships, I should be wary of needlessly adding explanatory variables to a model (so as to avoid “overfitting” the model). In the next section, I attempt to fit this model, piecewise, using these identified variables to explain quality of red wines.

Multivariate Plots Section

By now, I think I’ve narrowed in on the variables I want to use to explain wine quality (in this context).

I begin by building the model using a single-regressor: alcohol. I chose this because it showed the highest correlation coefficient (and therefore, because this is single linear regression, the highest coefficient of determination). I run the regression on the log-transform alcohol (informed by the Univariate analysis above).

## 
## Call:
## lm(formula = quality ~ log(alcohol), data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.8653 -0.3942 -0.1694  0.5100  2.5846 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -3.4740     0.4204  -8.263 2.95e-16 ***
## log(alcohol)   3.8948     0.1796  21.687  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.71 on 1597 degrees of freedom
## Multiple R-squared:  0.2275, Adjusted R-squared:  0.227 
## F-statistic: 470.3 on 1 and 1597 DF,  p-value: < 2.2e-16

Just to see how this model might change (given what we discovered in the Univariate Analysis section) if I log-transformed alcohol and removed outliers, I rerun the regression doing just that.

## 
## Call:
## lm(formula = quality ~ log(alcohol), data = subset(df.outs, alcohol_outlier == 
##     0))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.8872 -0.3691 -0.1511  0.5054  2.5886 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -3.9909     0.4774   -8.36   <2e-16 ***
## log(alcohol)   4.1195     0.2049   20.11   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6986 on 1527 degrees of freedom
## Multiple R-squared:  0.2093, Adjusted R-squared:  0.2088 
## F-statistic: 404.3 on 1 and 1527 DF,  p-value: < 2.2e-16

As we see from the output, the regression does worse (according to the R-squared value), implying that the outliers tended to be either good or bad wines – the variance they introduced could be, on average, explained better by alcohol than by omitting them altogether.

Now, I begin to add a few more variables. Second up is volatile.acid because it doesn’t show much of a linear relationship with alcohol, but does show the second highest linear relationship with our response variable, quality. Third is sulphates, but transformed logarithmically as per the above exploration.

## 
## Calls:
## model1: lm(formula = quality ~ log(alcohol), data = df)
## model2: lm(formula = quality ~ log(alcohol) + volatile.acidity, data = df)
## model3: lm(formula = quality ~ log(alcohol) + volatile.acidity + I(log(sulphates)), 
##     data = df)
## model4a: lm(formula = quality ~ log(alcohol) + volatile.acidity + I(log(sulphates)) + 
##     fixed.acidity, data = df)
## model4b: lm(formula = quality ~ log(alcohol) + volatile.acidity + I(log(sulphates)) + 
##     I(log(1e-05 + citric.acid)), data = df)
## model4c: lm(formula = quality ~ log(alcohol) + volatile.acidity + I(log(sulphates)) + 
##     pH, data = df)
## 
## ===================================================================================================================
##                                   model1        model2        model3       model4a       model4b       model4c     
## -------------------------------------------------------------------------------------------------------------------
##   (Intercept)                     -3.474***     -1.563***     -1.132**      -1.488***     -1.126**      -0.441     
##                                   (0.420)       (0.416)       (0.411)       (0.437)       (0.415)       (0.478)    
##   log(alcohol)                     3.895***      3.390***      3.276***      3.328***      3.274***      3.413***  
##                                   (0.180)       (0.172)       (0.169)       (0.171)       (0.170)       (0.176)    
##   volatile.acidity                              -1.384***     -1.157***     -1.103***     -1.162***     -1.087***  
##                                                 (0.095)       (0.097)       (0.100)       (0.106)       (0.100)    
##   I(log(sulphates))                                            0.639***      0.612***      0.640***      0.614***  
##                                                               (0.077)       (0.078)       (0.077)       (0.077)    
##   fixed.acidity                                                              0.023*                                
##                                                                             (0.010)                                
##   I(log(1e-05 + citric.acid))                                                             -0.001                   
##                                                                                           (0.006)                  
##   pH                                                                                                    -0.320**   
##                                                                                                         (0.114)    
## -------------------------------------------------------------------------------------------------------------------
##   R-squared                        0.228         0.318         0.346         0.348         0.346         0.349     
##   adj. R-squared                   0.227         0.317         0.345         0.347         0.345         0.348     
##   sigma                            0.710         0.667         0.654         0.653         0.654         0.652     
##   F                              470.343       371.904       281.523       213.153       211.015       214.040     
##   p                                0.000         0.000         0.000         0.000         0.000         0.000     
##   Log-likelihood               -1720.253     -1620.771     -1586.889     -1584.082     -1586.882     -1582.924     
##   Deviance                       805.061       710.868       681.372       678.984       681.366       678.001     
##   AIC                           3446.507      3249.541      3183.778      3180.165      3185.764      3177.848     
##   BIC                           3462.638      3271.050      3210.663      3212.428      3218.027      3210.110     
##   N                             1599          1599          1599          1599          1599          1599         
## ===================================================================================================================

Looking at the adjusted R-squared, we see that the inclusion of each of the first three variables is adding to the amount of variability explained by the model in a non-trivial way. While the strength of this explanation isn’t super great, the significance testing does give us an indication that these variables, when combined together, are useful.

The model4 variations (which add fixed.acidity, log(citric.acid), and pH to model3, respectively but not cumulatively) show that the inclusion of any of these variables does little to improve the overall model’s explanatory power.

Note: I am employing linear regression analysis here, but this might not be the best analytical approach given these data. For example, the quality variables is essentially categorical (even if expressed in integer rankings). This begs the question whether multiple logistic regression or another approach would be more appropriate.

To see how this looks in terms of errors, the following charts summary statistics via boxplot of the conditional errors for each model, numbered 1 through 3 according to the model build-up from above.

We can see the marked improvement of moving from a simple linear regression to one that involves 2 variables. More variables does provide more explanation, but only trivially so (hence the blue and green boxes looking so similar for any given wine quality).


Final Plots and Summary

Plot One

This is a scatterplot matrix that shows the pairwise correlation coefficients for the variables in this data set. The color of a block illustrates the “intensity” of any given linear relationship. I included this because it transmits a lot of information about the [potential] linear relationships between any two variables.

Plot Two

Plot 2 comes from the Bivariate Analysis section where I compared variables of interest against one another. This plot demonstrates, using free.sulfur.dioxide and total.sulfur.dioxide, that some of the chemical properties correlate with each other. This had a bearing on choosing (as seen in the Multivariate analysis section). This also shows, as seen in other covariates, that the relationships are generally unconditioned on wine quality.

Plot Three

Here we see that a simple linear regression (i.e., a model with only one regressor) can be improved upon by adding other variables to the regession equation, but the improvement reaches its explanatory limit pretty quickly (this chart, for clarity shows a two-factor, “Model 2,” and a three-factor, “Model 3,” for comparison purposes).

The three models (generated in the Multivariate Analysis section) are described below, in order:

\[ \begin{eqnarray} \hat{y}_{1} &=& 1.875 + 0.361x_{\text{alcohol}} \\ \hat{y}_{2} &=& 3.095 + 0.314x_{\text{alcohol}}-1.384x_{\text{volatile acidity}} \\ \hat{y}_{3} &=& 3.359 + 0.303x_{\text{alcohol}}-1.156x_{\text{volatile acidity}} + 0.641log(x_{\text{sulphates}}) \\ \end{eqnarray} \]


Reflection

Struggles

The exploration of the red wine data set revealed interesting relationships between the response variable (quality) and the chemical property variables. It several of these explanatory variables were interrelated in meaningful ways (as measured by correlation), and the building of a linear regression model at the end showed some of the shortcomings of this EDA.

First, the distribution of wines overwhelmingly favored mediocre ratings (i.e., 5 or 6). This left [relatively] few data points on either end of the quality spectrum from which to draw conclusions. Another way to think about this is that the the data set was “weighted” heavily to the middle, and picking out (or predicting) determinants of quality at either end of this spectrum were washed out by this “weighting.” The 10-point quality scale really collapsed into a 6-point scale since the data did not observe wines that were exceptional, in either good or bad ways.

Second, these data only contain chemical properties, which may exclude variables that could contribute to more predictive bearing on wine quality. An example of the type of grape, season, year/vintage, price, or other characteristics might exhibit stronger relationships with wine quality or might be good conditional variables to show a stronger relationship might exist to wine quality when conditioning on one (or several) of these unavailable variables.

Third, and likely the most important difficulty, is that I lack enough domain knowledge to properly architect a model. As a layman, I have very little understanding of the nature of these chemical properties (and have treated them as abstracted variables, devoid of contextual meaning). This is great for getting a basic sense of how variables interrelate, but not really why they do.

Finally, on a personal note, I have a long, long way to go before I the mechanics of implementing EDA in R become smooth. If I had to guesstimate, I’d wager I spent close to 80% of my time outside of the code…time spent figuring out how to do what my mind’s eye had already envisioned. If I must be honest, I think I still prefer Python. :)

Surprises

In general, I was surprised to find that none of these chemical properties (alone or in combination) did a particularly great job explaining the variation in red wine quality. In other words, I thought that the presumed discriminating tastes of sommeliers would correlate with chemical properties of the wine in a more obvious way.

I was also surprised to see the degree to which only slight transformations of skewed data resulted in univariate distributions that approached normality, and that many distributions were somewhat normal to begin with.